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Abstract — Link failures in wide area networks are common. 
To recover from such failures, a number of methods such as 
SONET rings, protection cycles, and source rerouting have been 
investigated. Two important considerations in such approaches 
are the recovery time and the needed spare capacity to complete 
the recovery. Usually, these techniques attempt to achieve a 
recovery time less than 50 ms. In this paper we introduce an 
approach that provides link failure recovery in a hitless manner, 
or without any appreciable delay. This is achieved by means of a 
method called diversity coding. We present an algorithm for the 
design of an overlay network to achieve recovery from single 
link failures in arbitrary networks via diversity coding. This 
algorithm is designed to minimize spare capacity for recovery. 
We compare the recovery time and spare capacity performance 
of this algorithm against conventional techniques in terms of 
recovery time, spare capacity, and a joint metric called Quality 
of Recovery (QoR). QoR incorporates both the spare capacity 
percentages and worst case recovery times. Based on these results, 
we conclude that the proposed technique provides much shorter 
recovery times while achieving similar extra capacity, or better 
QoR performance overall. 

I. Introduction 

Failures in communication networks are common and can 
result in substantial losses. For example, in the late 1980s, 
the AT&T telephone network encountered a number of highly 
publicized failures [fl~), 0. In one case, much of the long 
distance service along the East Coast of the U.S. was disrupted 
when a construction crew accidentally severed a major fiber 
optic cable in New Jersey. As a result, 3.5M call attempts 
were blocked (TJ. On another occasion, of the 148M calls 
placed during the nine-hour-long period of the failure, only 
half went through, resulting in tens of millions of dollars worth 
of collateral damage for AT&T as well as many of its major 
customers J2). 

Observing that such wide-scale network failures can have a 
huge impact, in February 1992, the Federal Communications 
Commission (FCC) of the U.S. issued an order requesting 
that earners report any major outages affecting more than 
50K customers lasting for more than 30 minutes. Over a 
decade, the reports made available to the public showed 
that network failures are very common and cause significant 
service interruptions. According to the publicly available data, 
while most of the reported events impacted up to 250K users, 
some impacted millions of users (3). 

In the early 1990s, AT&T decided to address the restoration 
problem for its long distance network with an automatic 
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centrally controlled mesh recovery scheme, called FASTAR, 
based on digital cross-connect systems [4|. Since then, this 
subject has seen a significant amount of research. In mesh- 
based network link failure recovery, the two nodes at the end of 
the failed link can switch over to spare capacity. Alternatively, 
all the affected paths could be switched over to spare capacity 
in a distributed fashion. While the former is faster, the latter 
will have smaller spare capacity requirement. In this paper, we 
will use the term source rerouting to refer to mesh-based link 
or path protection algorithms. In simulations we employ the 
Simplest Spare Capacity Allocation (SSCA) algorithm 0. 

In the mid-1990s, specifications for an automatic protec- 
tion capability within the Synchronous Optical Networking 
(SONET) transmission standard were developed. These later 
became the International Telecommunications Union (ITU) 
standards G.707 and G.708. The basic idea for protection 
is to provide 100% redundant capacity on each transmission 
path through employment of ring structures. SONET can 
accomplish fast restoration (telephone networks have a goal 
of restoration within 50 ms after a failure to keep perception 
of voice quality unchanged by human users) at the expense of 
a large amount of spare capacity [6 |, [7 |. The restoration times 
for mesh-based rerouting techniques are typically larger than 
those of SONET rings, however, the extra transport capacity 
they require for restoration in the U.S. is generally better than 
that achievable by SONET rings. In late 1990s, with other 
major U.S. long distance carriers moving to SONET rings for 
restoration purposes, an industry-wide debate took place as 
to whether the mesh-based restoration or the SONET ring- 
based restoration is better. This debate still continues today. 
Although most researchers accept that mesh-based restoration 
may save extra capacity, restoration speeds achievable with 
mesh-based restoration are generally low and the signaling 
protocols needed for message feedback are an extra complexity 
that can also complicate the restoration process. 

An extension of the SONET rings is the technique known 
as p-cycles (8). In a network, a p-cycle is a ring that goes 
through all the nodes once. Such a ring will provide protection 
against any single link failure in the network because there is 
always an alternative path on the ring that connects the nodes 
at the end of the failed link, unaffected by the failure. The 
recovery is carried out by the two nodes that detect the failure 
at the two ends of the failed link. These nodes reroute the 
traffic on this link to the corresponding part of the p-cycle. 
Constructing p-cycles and the corresponding spare capacity 
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Fig. I. Diversity coding where JV parallel data links are protected against 
failure by one coded link, (a) Encoder and (b) Decoder. 

assignment can be solved by a number of algorithms [8 1. Some 
of these algorithms employ linear programming while there 
are a number of simpler design algorithms. In this paper, we 
employ the algorithm in [8, p. 699], which is considered to 
be within 5% of the optimal solution [8|. We would like to 
add that in the technique of p-cycles, it is possible to subdivide 
the network nodes and generate different p-cycle rings for each 
division separately ||8"1 . 

Recovery from link failures in IP networks can take a long 
time [7 1 becasue IP routing protocols were not designed to 
minimize network outages. There has been Internet research 
that shows a single link failure can cause users to experience 
outages of several minutes even when the underlying network 
is highly redundant with plenty of spare bandwidth available 
and with multiple ways to route around the failure [J\. Need- 
less to say, depending on the application, outages of several 
minutes are not acceptable, for example, for IP telephony, e- 
commerce, or telemedicine. 

Within the telephony transmission and networking commu- 
nity, hitless restoration from failures is often described as 
an ideal [8]. Nevertheless, with the methods considered, it 
could not be achieved because these methods are based on 
message feedback and rerouting, both of which take time. 
Whereas, with our method, hitless or near-hitless recovery 
from single link failures becomes possible given delay buffers 
that synchronize the paths. This introduces non-appreciable 
transmission delay. The basic technique is powerful enough 
that it can be extended to other network failures such as 
multiple link or node failures. 

The concept of a Quality of Recovery (QoR) metric was 
introduced previously in order to find an overall metric that 
evaluates the performance of a protection technique and com- 
pares it with others, see e.g., J9]. The arguments of the QoR 
metric depend on the problem and its application. In this 
paper, we employ a version of the QoR metric from [9 | that 
incorporates spare capacity percentage, restoration time, and 
data loss. 

II. Diversity Coding 

The basic idea in diversity coding is given in Fig. Q] |10|, 
irTTI . Here, digital links of equal rate dx, d%, . - . , djv are 
transmitted over disjoint paths to their destination. For the 
sake of simplicity, assume that these links have a common 
source and a common destination, and have the same length. 
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is transmitted over another equal length disjoint path. In the 
case of the failure of link di, the receiver can immediately 
form 
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since it has di,d2, ■ ■ ■ di— i, di+x, . . . , <ijv available and dj © 
dj = in modulo-2 arithmetic or logical XOR opera- 
tion. As a result, di is recovered by employing c\ and 
difdz,..., di~i,di+i, . . . , d^- It is important to recognize 
that this recovery is accomplished in a feedforward fashion, 
without any message feedback or rerouting. 

We assumed above that the sources and the destinations 
of dx, d-2, . . . , dx, ci are the same. Diversity coding can ac- 
tually be extended into network topologies where the source 
or the destination node is not common. Some examples of 
such network topologies include multi-point to point, point 
to multi-point and multi-point to multi-point connections. In 
some cases, there may be a designated encoding node and 
a designated decoding node, whereas in some other cases 
encoding operation can be carried out in source nodes in an 
incremental fashion. Decoding operation also can be done only 
at destination nodes, instead of a designated intermediate node. 
Examples of these topologies can be found in ifTOl . ifTTl . fl2l . 

Diversity coding papers [ 10], IfTTl predate the work that 
relate the multicast information flow in networks to the min- 
imum cut properties of the network ifPTI by about a decade. 
This latter work has given rise to the general area of network 
coding. However, in network coding, discovery of optimal 
techniques to achieve multiple unicast routing in general 
networks has remained elusive. In this paper, we provide a 
systematic approach to the related problem of designing an 
overlay network for link failure recovery in arbitrary networks, 
based on (To), IfTTl . 

As stated above, the main advantage of diversity coding as 
a recovery technique against failures in networks is the fact 
that it does not need any feedback messaging. Whereas, mesh- 
based source rerouting techniques, SONET rings, and the 
technique of p-cycles do need signaling protocols to complete 
rerouting. With diversity coding, as soon as the failure is 
detected, the data can be immediately recovered. As in network 
coding, this requires synchronization of the coded streams. 
We refer the reader to iflOl for a description of the need 
for synchronization as well as how to achieve it in diversity 
coding. 

A. Example 1 

We will now provide a simple example regarding the use of 
diversity coding for link failure recovery. Consider the network 
in Fig. 121 a). This network has a similar topology to the well- 
known butterfly network commonly used to illustrate the basic 
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Fig. 2. A simple example for link failure recovery via diversity coding. 

concept of multicasting via network coding, first appeared in 

m. 

In this example, the source node Si wishes to transmit its 
data A to destination node D\ and the source node 52 wishes 
to transmit its data B to destination node D2, shown by solid 
lines. The restoration network is shown via dashed lines. There 
is an encoder on top which forms A © B which we show as 
A + B. This data is then transmitted to the decoder node. The 
decoder forms the summation of the data received from the 
encoder and the two destination nodes. In the case of failures, 
some of these data will not be present. However, the network 
is designed such that the destination node will automatically 
receive the missing data from the restoration network in an 
automatic fashion. In this example, the central decoder does 
not carry out any failure detection. This task is carried out by 
the destination nodes D\ and D2 as described below. 

In the case of regular operation, the destination nodes 
receive their data from their data links and receive "0" from 
the restoration network, as shown in Fig. |2J a )- Assume the link 
from Si to D\ carrying data A failed. In this case, both of 
the nodes D\ and D2 receive data A automatically from the 
restoration network, as shown in Fig. [2] Node D\ uses this 
data instead of what it should have been receiving directly 
from node Si. Since node D2 is receiving its regular data B 
directly from S2, it ignores the data transmitted by the central 
decoder. The symmetric failure case for the link from S% to D2 
is shown in Fig. Etc). Other failure scenarios will be ignored 
by Di and D2 since in those cases they receive their data 
directly from the respective sources S\ and S2. An example 
of this latter mode of operation is depicted in Fig. I2d). 

B. Example 2 

In this example, we will show that diversity coding can 
result in less spare capacity than source rerouting or p-cycles. 
Refer to Fig. 0a). This figure shows the available topology of 
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Fig. 3. Spare capacity comparison example. 

the network. In this network, each link is bidirectional. There 
are 4 unit rate flows in the network represented as a, b, c, and 
d, where a and b are from node 1 to node 4, c is from node 
2 to node 4, and d is from node 3 to node 4. The solution 
for diversity coding is shown in Fig. [3}b). In this solution, 
path 1-5-4 is a spare link for the protection of either a or 
b, and it carries modulo-2 sum of a and b. Note that coded 
signals are not limited to spare links, as the modulo-2 sum 
of a and c is delivered over the primary path of a. The same 
applies to the primary path of b. The signals are coded in 
such a way that node 4 can derive a, b, c, and d given any 
four of the five incoming links form a full-rank matrix if the 
remaining link failed. Fig. (3jc) represents the best solution 
in the case of source rerouting. The upper link is used to 
protect any failure in transmitting a, b or c. The lower link 
between nodes 3 and 4 is used to protect flow d. Different 
from the previous case, we need two unit capacity over the 
upper link between nodes 3 and 4 due to transmission of b 
and d separately. The best solution for p-cycles is given in 
Fig. |3d). In this solution, there is only one ring that protects 
every signal. Due to the intermediate node 5, the p-cycles 
solution cannot offer protection for the path 1-5-4. Protection 
capacity p, which is unit rate, on the cycle is reserved to carry 
any failed signal a, b, c, and d since a failure affects at most 
one of these signals. This guarantees full operation after any 
failure recovery with an extra cost with respect to diversity 
coding. Clearly, in this example, both of the approaches of 
source rerouting and p-cycles result in more spare capacity as 
compared to the approach of diversity coding. 

III. Coding in Networks with Arbitrary Topology 

We will now apply the technique described in the previous 
section to the design of an overlay network for recovery from 
link failures in arbitrary networks. We approach this problem 
by examining all possible combinations of standard diversity 
coding iflOl . ifTTI . IlIU , In doing this, our goal is to come 
up with a network for which the spare capacity introduced 
due to diversity coding is minimized. We employ redundancy 



ratio [12], as the metric that will quantify the efficiency of 
a particular combination chosen. Redundancy ratio measures 
the extra capacity introduced in diversity coding. Due to space 
limitations, we refer the reader to iflZl for its definition. 

A. Proposed Algorithm 

We will now discuss how we utilize the redundancy ratio 
of each diversity coding combination in designing an efficient 
diversity coding scheme for a network with arbitrary topology. 

The proposed algorithm is intended to search for all pos- 
sible diversity coding combinations and select those with the 
smallest redundancy ratio. To that end, we employ a variable 
called Threshold. The threshold begins with a small value 
(ThrsdLow). Diversity coding combinations of N working 
paths with redundancy ratio values smaller than Threshold 
are accepted, and then Threshold is incremented up to its 
maximum value (ThrsdHgh.) Within this process, the value N 
is decremented from a maximum of N max down to 2. The set 
of unprotected paths is called the DemandMatrix, and when 
N working paths satisfying the redundancy ratio are found, 
they are taken out of DemandMatrix. At the end, a number of 
paths may remain uncoded. We protect every such path by a 
dedicated spare path which carries the same data, known as 
1+1 APS (Automatic Protection Switch). 

A description of the algorithm is given under the heading 
Algoritm I. In our simulations for this paper, the numerical 
values used are ThrsdLow = 1.6, ThrsdHgh = 3.0, and N max = 
4. 

IV. Performance Metrics SCP, RT, and QoR 

There are two dominant factors that specify the performance 
of a protection technique. These are spare capacity percentage 
SCP and restoration time RT. The QoR metric combines 
these quantities into a single one and presents a clearer com- 
parison among restoration techniques. The values of SCP are 
calculated via simulations over sample networks and traffic, 
which are given in the next section. We employ the following 
formula for calculating SCP in all simulations 



SCP = 



Total Capacity-Shortest Working Capacity 



Shortest Working Capacity 

Shortest Working Capacity is the total capacity when there 
is no protection and the traffic is routed over shortest paths. 
The restoration time RT is defined as the longest duration 
that the connection is lost during the recovery process. RT is 
calculated by a modified version of a formula from [14]. For 
source rerouting, p-cycles, and diversity coding algorithms, the 
following formulas are used to calculate the restoration time, 
in respective order 

RT sr = F + nP + (n + 1) ■ D + (m + 1) • C + 3 • P 

+ 3 • (m+ 1) ■ D + EP 
RT pc = F+(n + l)-D + 2- C + P + EP 
RT dc = F + 2-D + PD. 

As in [14], we use F: the time to detect a failure, D: 
node message processing time, C: time to configure a network 



Algorithm I: Code Assignment for Link Failure 
Recovery via Diversity Coding 



for Threshold=ThrsdLow to ThrsdHgh do 

for all combinations of N = N max , . . .,3,2 do 
if diversity ratio of combination < Threshold 
then 

if flowi, . . . , flowx € DM then 
for i = 1 to K do 
| DM = DM - {flowi} 
end 

Update the total, working, and space 
capacities 

end 

end 

end 
end 

for all flowk 6 DM do 

Apply 1+1 APS protection 

DM = DM - flow k 

Update the total, working, and space capacities 
end 



switch, m: number of hops in the backup route, and n: number 
of hops from the source node of the failed link to the source. 
P is the propagation time for the protection path, EP is the 
propagation time of failure to the closest node and nP is the 
propagation time until the error signal reaches the source- end 
node. In addition, PD means propagation delay difference 
between link-disjoint paths in the diversity coding scheme. As 
in JT5), we set F to 100 /is. Similarly to fl4l . we set the 
variable C a number of values, i.e., 500 /is, 1 ms, 5 ms, and 
10 ms. The particular form of the QoR metric we employ is 
based on [9]. We define the contributions due to RT and SCP 
as 

1 1 



CRT 



1 + 400 ■ RT 2 



, Q 



SCP = 



1+ (SCP\ 
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where RT is in seconds and the factor 400 accounts for setting 
Qrt = 1/2 for RT — 50 ms (9). Similarly, normalization 
with 100 is to set Q SC p = 1/2 when SCP = 100. Finally, 
we incorporate restoration time, data loss, and spare capacity 
into the QoR metric as 

^ D 2 • Q RT + Qscp 
QoR = 

where the factor 2 accounts for both restoration time and data 
loss, which is proportional to RT |9). 

V. Simulation Results 

In this section, we will present simulation results for link 
recovery techniques previously discussed, in terms of their 
spare capacity requirements and their restoration times. 

The first network studied is the European COST 239 net- 
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U.S. Long Distance Network 



Fig. 4. European COST 239 network. Distances are in km. 

TABLE I 
COST 239 Network 



COST 239 Network, 1 1 nodes, 26 spans 


Scheme 


SCP 


RT for different C values (ms) 


0.5ms 


1ms 


5ms 


10ms 


Div. Coding 


98% 


4.8 


4.8 


4.8 


4.8 


Source Rerout. 


90% 


39.8 


41.8 


57.8 


77.8 


p-cycles 


64% 


26.1 


27.1 


35.1 


45.1 




Fig. 5. U.S. long distance network. Distances are in tens of miles. 

work whose topology is given in Figure |4] [ 16] . In this graph 
as well as the others in the sequel, the numbers associated with 
the nodes represent a node index, while the numbers associated 
with the edges correspond to the distance associated with the 
edge. The traffic demand is adopted from [16| and applied 
to the simulation. This network was previously studied in the 
context of link failure recovery [8 |. We provide SCP and RT 
results for the three schemes in Table U 

The second network is based on the U.S. long-haul optical 
network. The topology of this network is shown in Figure |5] It 
is based on the topology given in [5]. In order to calculate the 
traffic, we employed a gravity-based model [17 | and assumed 
the traffic between two nodes is directly proportional to the 
product of the populations of the locations represented by these 
nodes. Values of SCP and RT for this network are given in 
Table M 

The third network is one that favors diversity coding over 
the other two approaches in terms of spare capacity. We came 
up with this network in order to provide a different example 
than the two previous networks. The topology of this network 



US Long Distance Network, 28 nodes, 45 spans 


Scheme 


SCP 


RT for different C values (ms) 


0.5ms 


1ms 


5 ms 


10ms 


Div. Coding 


106% 


9.5 


9.5 


9.5 


9.5 


Source Rerout. 


91% 


79.7 


83.7 


115.7 


155.7 


p-cycles 


107% 


59.6 


60.6 


68.6 


78.6 
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Fig. 6. Synthetic network. Distances are in miles. 

TABLE III 
Synthetic Network 



Synthetic Network, 9 nodes, 20 spans 


Scheme 


SCP 


RT for different C values (ms) 


0.5ms 


1ms 


5ms 


10ms 


Div. Coding 


81% 


0.9 


0.9 


0.9 


0.9 


Source Rerout. 


100% 


5.7 


8.2 


28.2 


53.2 


p-cycles 


85% 


2.8 


3.8 


11.8 


21.8 



is given in Fig. [6] The demand in this network is set such that 
it is symmetric and most of it originates from and terminates 
at the two end nodes 1 and 9 0121 . The values of SCP and 
RT are provided in Table [ill] For this network, the best spare 
capacity results are obtained by employing the diversity coding 
approach, similarly to the case we showed in the example in 
Section II.B. 

Comparing the values of SCP for the three networks in 
Tables I MII1 we observe that the three techniques achieve all 
possible SCP performance orderings, from number one to 
number three. On the other hand, in terms of the RT perfor- 
mance, the proposed technique is always substantially better. 
As can be observed, the improvement in RT performance 
can be close to or even more than an order of magnitude. 
It is worthwhile to observe that for the U.S. Long Distance 
network, the RT values with source rerouting or p-cycles are 
above the critical threshold of 50 ms for all values of C, the 
network switch reconfiguration time. For this network, values 
of RT are well below the 50 ms threshold when diversity 
coding is employed. We would like to note that RT values 
for diversity coding can be reduced even further. Recall that 
RT dc = F + 2-D + PD. PD becomes equal to zero if there is 
one destination node and delay equalization is performed. In 
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Fig. 7. Quality of Recovery metric results for all three techniques (a) COST 
239 and U.S. Long Distance network (b) Synthetic network. 

this case, RT will become very small, about 300 /its, making 
the diversity coding alternative nearly hitless. 

As discussed earlier, it is possible to combine RT and SCP 
into a single metric QoR. Fig. 7 shows values of QoR for the 
three networks. The results show that QoR for diversity coding 
is better than the other techniques for all of the networks and 
for all possible values of the variable C. Note that while the 
QoR performance of source rerouting and p-cycles become 
worse as C increases, that of diversity coding is independent 
of C, because there is no rerouting involved. 

VI. Conclusion 

In this paper, we employed the technique of diversity 
coding for providing a single link failure recovery technique 
in networks with arbitrary topologies. This is accomplished by 
finding groups of links that can be combined in basic diversity 
coding topologies, or in other words, mapping the arbitrary 
topology into efficient groups of basic diversity coding topolo- 
gies. This approach results in link failure restoration schemes 
that do not require message feedback or rerouting and there- 
fore are extremely efficient in terms of their restoration speed 
as shown via realistic calculations. Erasure coding techniques 



can be employed to extend this technique to recovery from 
more than one link and node failures. 

We would like to note that a number of recent publications 
discuss a network coding based link recovery technique [18], 
|[T9l , similar to diversity coding iflOl . ifTTI . where advantages 
of this technique over 1+1 APS in networks are illustrated. It 
should be noted that, unlike the source rerouting an p-cycles 
techniques, 1+1 APS is not considered a network restoration 
technique because it is quite clear that 1+1 restoration is highly 
inefficient for SCP. The comparison of a technique such 
as diversity coding for network restoration should be made 
against techniques such as source rerouting or p-cycles, as 
carried out in this paper. 
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